Markus Harrer, Software Development Analyst
@feststelltaste
https://feststelltaste.de
JUG Nürnberg, 28.02.2019

"Statistik auf nem Mac."
=> Belastbare Erkenntnisse mittels Fakten liefern
"The aim of science is to seek the simplest explanations of complex facts."
=> Neue Erkenntnisse verständlich herausarbeiten
"Jemand, der mehr Ahnung von Statistik
hat als ein Softwareentwickler
und mehr Ahnung von Softwareentwicklung
als ein Statistiker."
Data Science & Software Data: Perfect match!
=> Krass viel!
=> vom Problem über die Daten zur Erkenntnis!
I. Idee
II. Laden
III. Filtern
IV. Joinen
V. Aggregieren // fun part
VI. Visualisieren // not so funny part, but...
Eher zweiranging...jetzt hier vorwiegend:
Es geht aber auch...
// :-)Meta-Ziel: Alles mal sehen anhand eines einfachen Show-Cases.
Frage 1: Gibt es Module mit Teammonopole?
Unsere Heurisik: Ändert nur ein Team hauptsächlich Module?
from ozapfdis import git
log = git.log_numstat("../../../dropover/")
log.head(3)
| additions | deletions | file | sha | timestamp | author | |
|---|---|---|---|---|---|---|
| 0 | 191 | 0 | backend/pom-2016-07-16_04-40-56-752.xml | 8c686954 | 2016-07-22 17:43:38 | Michael |
| 1 | 1 | 1 | backend/src/test/java/at/dropover/scheduling/i... | 97c6ef96 | 2016-07-16 09:51:15 | Markus |
| 2 | 19 | 3 | backend/src/main/webapp/app/widgets/gallery/js... | 3f7cf92c | 2016-07-16 09:07:31 | Markus |
Was haben wir hier eigentlich?
log.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2403 entries, 0 to 2402 Data columns (total 6 columns): additions 2403 non-null object deletions 2403 non-null object file 2403 non-null object sha 2403 non-null object timestamp 2403 non-null datetime64[ns] author 2403 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 112.7+ KB
1 DataFrame (~ programmierbares Excel-Arbeitsblatt), 6 Series (= Spalten), 2403 Rows (= Einträge)
Wir machen nur mit Java-Produktionscode weiter
java = log.copy()
java = java[java['file'].str.startswith("backend/src/main/java")]
java = java[~java['file'].str.contains("package-info.java")]
java.head(3)
| additions | deletions | file | sha | timestamp | author | |
|---|---|---|---|---|---|---|
| 4 | 3 | 4 | backend/src/main/java/at/dropover/files/intera... | ec85fe73 | 2016-07-16 08:12:29 | Chris |
| 36 | 3 | 2 | backend/src/main/java/at/dropover/scheduling/i... | bfea33b8 | 2016-07-16 02:02:02 | Markus |
| 44 | 23 | 1 | backend/src/main/java/at/dropover/scheduling/i... | ab9ad48e | 2016-07-16 00:50:20 | Chris |
Wir ordnen bei uns Committer zu Teams zu.
Schritt 1: Weitere Datenquelle einlesen.
import pandas as pd
orga = pd.read_excel("../dataset/Teamorganisation.xlsx", index_col=0)
orga
| team | |
|---|---|
| name | |
| Markus | A |
| Chris | B |
| Michael | C |
Wir ordnen bei Committer zu Teams zu.
Schritt 2: Datenquellen joinen
java = java.join(orga, on="author")
java
| additions | deletions | file | sha | timestamp | author | team | |
|---|---|---|---|---|---|---|---|
| 4 | 3 | 4 | backend/src/main/java/at/dropover/files/intera... | ec85fe73 | 2016-07-16 08:12:29 | Chris | B |
| 36 | 3 | 2 | backend/src/main/java/at/dropover/scheduling/i... | bfea33b8 | 2016-07-16 02:02:02 | Markus | A |
| 44 | 23 | 1 | backend/src/main/java/at/dropover/scheduling/i... | ab9ad48e | 2016-07-16 00:50:20 | Chris | B |
| 47 | 68 | 6 | backend/src/main/java/at/dropover/files/intera... | 0732e9cb | 2016-07-16 00:27:20 | Chris | B |
| 53 | 7 | 3 | backend/src/main/java/at/dropover/framework/co... | ba1fd215 | 2016-07-15 22:51:46 | Chris | B |
| 54 | 15 | 7 | backend/src/main/java/at/dropover/mail/entity/... | ba1fd215 | 2016-07-15 22:51:46 | Chris | B |
| 55 | 27 | 26 | backend/src/main/java/at/dropover/mail/entity/... | ba1fd215 | 2016-07-15 22:51:46 | Chris | B |
| 56 | 11 | 9 | backend/src/main/java/at/dropover/mail/entity/... | ba1fd215 | 2016-07-15 22:51:46 | Chris | B |
| 57 | 1 | 1 | backend/src/main/java/at/dropover/mail/interac... | ba1fd215 | 2016-07-15 22:51:46 | Chris | B |
| 62 | 6 | 1 | backend/src/main/java/at/dropover/scheduling/i... | 60fcf2c9 | 2016-07-15 22:22:16 | Chris | B |
| 63 | 1 | 0 | backend/src/main/java/at/dropover/scheduling/b... | 0937d598 | 2016-07-15 22:18:48 | Chris | B |
| 64 | 1 | 1 | backend/src/main/java/at/dropover/scheduling/i... | 0937d598 | 2016-07-15 22:18:48 | Chris | B |
| 66 | 1 | 0 | backend/src/main/java/at/dropover/scheduling/b... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 67 | 9 | 0 | backend/src/main/java/at/dropover/scheduling/b... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 68 | 18 | 0 | backend/src/main/java/at/dropover/scheduling/d... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 69 | 2 | 0 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 70 | 1 | 0 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 71 | 5 | 1 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 72 | 26 | 0 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 73 | 3 | 0 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 74 | 21 | 3 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 75 | 11 | 0 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 76 | 11 | 0 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 77 | 2 | 1 | backend/src/main/java/at/dropover/scheduling/e... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 78 | 28 | 0 | backend/src/main/java/at/dropover/scheduling/i... | 432113a2 | 2016-07-15 21:17:07 | Chris | B |
| 88 | 1 | 1 | backend/src/main/java/at/dropover/framework/co... | fd886df3 | 2016-07-15 19:05:44 | Markus | A |
| 92 | 2 | 2 | backend/src/main/java/at/dropover/framework/co... | c5e87fba | 2016-07-15 17:42:12 | Markus | A |
| 94 | 2 | 2 | backend/src/main/java/at/dropover/comment/deli... | a526cce1 | 2016-07-15 16:40:07 | Chris | B |
| 95 | 2 | 2 | backend/src/main/java/at/dropover/comment/deli... | a526cce1 | 2016-07-15 16:40:07 | Chris | B |
| 96 | 7 | 7 | backend/src/main/java/at/dropover/comment/enti... | a526cce1 | 2016-07-15 16:40:07 | Chris | B |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2276 | 1 | 2 | backend/src/main/java/at/dropover/todo/boundar... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2277 | 1 | 1 | backend/src/main/java/at/dropover/todo/boundar... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2278 | 1 | 1 | backend/src/main/java/at/dropover/todo/boundar... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2279 | 2 | 2 | backend/src/main/java/at/dropover/todo/deliver... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2280 | 1 | 1 | backend/src/main/java/at/dropover/todo/deliver... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2281 | 2 | 2 | backend/src/main/java/at/dropover/todo/entity/... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2282 | 3 | 3 | backend/src/main/java/at/dropover/todo/interac... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2283 | 5 | 5 | backend/src/main/java/at/dropover/todo/interac... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2284 | 2 | 2 | backend/src/main/java/at/dropover/todo/interac... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2285 | 5 | 5 | backend/src/main/java/at/dropover/todo/interac... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2286 | 11 | 0 | backend/src/main/java/at/dropover/todo/interac... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2287 | 4 | 4 | backend/src/main/java/at/dropover/todo/interac... | bfb3dcd0 | 2013-03-29 23:05:29 | Chris | B |
| 2290 | 5 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 0f17d3c0 | 2013-03-29 22:00:58 | Chris | B |
| 2291 | 9 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 0f17d3c0 | 2013-03-29 22:00:58 | Chris | B |
| 2292 | 11 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 0f17d3c0 | 2013-03-29 22:00:58 | Chris | B |
| 2293 | 85 | 0 | backend/src/main/java/at/dropover/todo/entity/... | 0f17d3c0 | 2013-03-29 22:00:58 | Chris | B |
| 2294 | 38 | 0 | backend/src/main/java/at/dropover/todo/interac... | 0f17d3c0 | 2013-03-29 22:00:58 | Chris | B |
| 2295 | 10 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2296 | 10 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2297 | 6 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2298 | 10 | 0 | backend/src/main/java/at/dropover/todo/boundar... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2299 | 100 | 0 | backend/src/main/java/at/dropover/todo/deliver... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2300 | 70 | 0 | backend/src/main/java/at/dropover/todo/deliver... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2301 | 13 | 0 | backend/src/main/java/at/dropover/todo/entity/... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2302 | 44 | 0 | backend/src/main/java/at/dropover/todo/entity/... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2303 | 69 | 0 | backend/src/main/java/at/dropover/todo/entity/... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2304 | 61 | 0 | backend/src/main/java/at/dropover/todo/interac... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2305 | 76 | 0 | backend/src/main/java/at/dropover/todo/interac... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2306 | 62 | 0 | backend/src/main/java/at/dropover/todo/interac... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
| 2307 | 76 | 0 | backend/src/main/java/at/dropover/todo/interac... | 21fbc9f1 | 2013-03-29 22:00:27 | Chris | B |
1191 rows × 7 columns
Wir fassen Daten nach Modulen (=Bestandteil vom Package-Name) zusammen.
java['module'] = java['file'].str.split("/").str[6]
java.head()
| additions | deletions | file | sha | timestamp | author | team | module | |
|---|---|---|---|---|---|---|---|---|
| 4 | 3 | 4 | backend/src/main/java/at/dropover/files/intera... | ec85fe73 | 2016-07-16 08:12:29 | Chris | B | files |
| 36 | 3 | 2 | backend/src/main/java/at/dropover/scheduling/i... | bfea33b8 | 2016-07-16 02:02:02 | Markus | A | scheduling |
| 44 | 23 | 1 | backend/src/main/java/at/dropover/scheduling/i... | ab9ad48e | 2016-07-16 00:50:20 | Chris | B | scheduling |
| 47 | 68 | 6 | backend/src/main/java/at/dropover/files/intera... | 0732e9cb | 2016-07-16 00:27:20 | Chris | B | files |
| 53 | 7 | 3 | backend/src/main/java/at/dropover/framework/co... | ba1fd215 | 2016-07-15 22:51:46 | Chris | B | framework |
Wir markieren Dateiänderungen über neues Flag.
java["changes"] = 1
java.head(3)
| additions | deletions | file | sha | timestamp | author | team | module | changes | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 3 | 4 | backend/src/main/java/at/dropover/files/intera... | ec85fe73 | 2016-07-16 08:12:29 | Chris | B | files | 1 |
| 36 | 3 | 2 | backend/src/main/java/at/dropover/scheduling/i... | bfea33b8 | 2016-07-16 02:02:02 | Markus | A | scheduling | 1 |
| 44 | 23 | 1 | backend/src/main/java/at/dropover/scheduling/i... | ab9ad48e | 2016-07-16 00:50:20 | Chris | B | scheduling | 1 |
Wir fassen Änderungen der Klassen pro Komponenten und Team zusammen.
changes = java.groupby(['module', "team"])[['changes']].sum()
changes.head()
| changes | ||
|---|---|---|
| module | team | |
| comment | A | 41 |
| B | 76 | |
| C | 1 | |
| creator | A | 14 |
| B | 33 |
Wir berechnen alle erfolgten Änderungen pro Modul...
changes['all'] = changes.groupby('module').transform('sum')
changes.head(3)
| changes | all | ||
|---|---|---|---|
| module | team | ||
| comment | A | 41 | 118 |
| B | 76 | 118 | |
| C | 1 | 118 |
...und damit die Änderungsverhältnisse pro Team.
changes['ratio'] = changes['changes'] / changes ['all']
changes.head()
| changes | all | ratio | ||
|---|---|---|---|---|
| module | team | |||
| comment | A | 41 | 118 | 0.347458 |
| B | 76 | 118 | 0.644068 | |
| C | 1 | 118 | 0.008475 | |
| creator | A | 14 | 54 | 0.259259 |
| B | 33 | 54 | 0.611111 |
Wir bauen uns ein Balkendiagramm mit den Verhältnissen der Teamänderungen.
changes['ratio'].unstack().plot.bar(stacked=True);
Frage 2: Wie gut passt die Modularisierung zum Team?
Unsere Heuristik: Werden fachliche Komponenten zusammengehörig geändert?
Wir analysieren für alle Dateien alle Änderungen (~ Stempelkarte).
commit_matrix = java.pivot('file', 'sha', 'changes').fillna(0)
commit_matrix.iloc[:5,50:55]
| sha | 3597d8a2 | 3b70ea7e | 3d3be4ca | 3e4ae692 | 429b3b32 |
|---|---|---|---|---|---|
| file | |||||
| backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| backend/src/main/java/at/dropover/comment/boundary/CommentData.java | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Wir berechnen den Abstand zwischen den vorgenommmenen Commits pro Datei (=Vektor)...
from sklearn.metrics.pairwise import cosine_distances
dis_matrix = cosine_distances(commit_matrix)
dis_matrix[:5,:5]
array([[0. , 0.29289322, 0.5 , 0.18350342, 0.29289322],
[0.29289322, 0. , 0.29289322, 0.1339746 , 0.5 ],
[0.5 , 0.29289322, 0. , 0.59175171, 0.29289322],
[0.18350342, 0.1339746 , 0.59175171, 0. , 0.42264973],
[0.29289322, 0.5 , 0.29289322, 0.42264973, 0. ]])
...und machen das schöner...
dis_df = pd.DataFrame(
dis_matrix,
commit_matrix.index,
commit_matrix.index)
dis_df.iloc[:5,:2]
| file | backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java | backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java |
|---|---|---|
| file | ||
| backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java | 0.000000 | 0.292893 |
| backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java | 0.292893 | 0.000000 |
| backend/src/main/java/at/dropover/comment/boundary/CommentData.java | 0.500000 | 0.292893 |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java | 0.183503 | 0.133975 |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java | 0.292893 | 0.500000 |
...und visualisieren das Zwischenergebnis.
import seaborn
ax = seaborn.heatmap(dis_df, xticklabels=False, yticklabels=False);
Weiter: Wir brechen nun die mehrdimensionale Matrix auf eine (fast gleichwertige) 2D-Repräsentation...
from sklearn.manifold import MDS
model = MDS(dissimilarity='precomputed', random_state=0)
dis_2d = model.fit_transform(dis_df)
dis_2d[:5]
array([[-0.5259277 , 0.45070158],
[-0.56826041, 0.21528001],
[-0.52746829, 0.34756761],
[-0.55856713, 0.26202797],
[-0.4036568 , 0.49803657]])
...und machen das mal wieder schöner...
dis_2d_df = pd.DataFrame(
dis_2d,
commit_matrix.index,
["x", "y"])
dis_2d_df.head()
| x | y | |
|---|---|---|
| file | ||
| backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java | -0.525928 | 0.450702 |
| backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java | -0.568260 | 0.215280 |
| backend/src/main/java/at/dropover/comment/boundary/CommentData.java | -0.527468 | 0.347568 |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java | -0.558567 | 0.262028 |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java | -0.403657 | 0.498037 |
...inkl. der Module.
dis_2d_df['module'] = dis_2d_df.index.str.split("/").str[6]
dis_2d_df.head()
| x | y | module | |
|---|---|---|---|
| file | |||
| backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java | -0.525928 | 0.450702 | comment |
| backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java | -0.568260 | 0.215280 | comment |
| backend/src/main/java/at/dropover/comment/boundary/CommentData.java | -0.527468 | 0.347568 | comment |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java | -0.558567 | 0.262028 | comment |
| backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java | -0.403657 | 0.498037 | comment |
Wir erzeugen uns eine interaktive Grafik.
Helferlein: An Unifying Software Integrator
from ausi import pygal
xy = pygal.create_xy_chart(dis_2d_df, "module")
xy.render_in_browser()
Frage 3: Gibt es eine alternative Modularisierung?
Unsere Heuristik: Wie würde sich das System rein nach seinen Änderungen strukturieren?
Wir nutzen hierarchisches Clustering, um anhand der Änderungsmuster alternative Modulestrukturen zu erkennen...
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering()
model = clustering.fit(commit_matrix)
model
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
connectivity=None, linkage='ward', memory=None, n_clusters=2,
pooling_func='deprecated')
...und visualisieren das Ergebnis.
from ausi.scipy import plot_dendrogram
plot_dendrogram(model, labels=commit_matrix.index)
https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/062-exploding-head.svg/600px-062-exploding-head.svg.png
1. Analysen mit Standard-Werkzeugen einfach möglich
2. Wer mehr will bekommt auch mehr!
3. Es gibt unglaublich viele Quellen für Daten in der Softwareentwicklung
Wichtig: Vom Problem über die Daten zur Erkenntnis!!
Markus Harrer
innoQ Deutschland GmbH
markus.harrer@innoq.com
@feststelltaste
https://feststelltaste.de
Demos & "Slides": https://github.com/feststelltaste/software-analytics => /demos/20190228_JUG_Nuremberg
